Impact of fault prediction on checkpointing strategies

نویسندگان

  • Guillaume Aupy
  • Yves Robert
  • Frédéric Vivien
  • Dounia Zaidouni
چکیده

This paper deals with the impact of fault prediction techniques on checkpointing strategies. We extend the classical analysis of Young in the presence of a fault prediction system, which is characterized by its recall and its precision, and which provides either exact or windowbased time predictions. We succeed in deriving the optimal value of the checkpointing period (thereby minimizing the waste of resource usage due to checkpoint overhead) in all scenarios. These results lay the foundations for future experimental validation of the model. Key-words: Fault-tolerance, checkpointing, prediction, migration, model, exascale ∗ LIP, École Normale Supérieure de Lyon, France † University of Tennessee Knoxville, USA ‡ Institut Universitaire de France § INRIA Étude de l’impact de la prédiction de fautes sur les stratégies de protocoles de checkpoint Résumé : Ce travail considère l’impact des techniques de prédiction de fautes sur les stratégies de protocoles de sauvegarde de points de reprise (checkpoints) et de redémarrage. Nous étendons l’analyse classique de Young en présence d’un système de prédiction de fautes, qui est caractérisé par son rappel (taux de pannes prévues sur nombre total de pannes) et par sa précision (taux de vraies pannes parmi le nombre total de pannes annoncées), et qui fournit des prédictions soit exactes soit avec des fenêtres. Dans ce travail, nous avons pu obtenir la valeur optimale de la période de checkpoint (minimisant ainsi le gaspillage de l’utilisation des ressources dû au coût de prise de ces points de sauvegarde) dans différents scénarios. Ce papier pose les fondations théoriques pour de futures expériences et une validation du modèle. Mots-clés : Tolérance aux pannes, checkpoint, prédiction, migration, modèle, exascale Impact of fault prediction on checkpointing strategies 3

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Checkpointing algorithms and fault prediction

This paper deals with the impact of fault prediction techniques on checkpointing strategies. We extend the classical first-order analysis of Young and Daly in the presence of a fault prediction system, characterized by its recall and its precision. In this framework, we provide an optimal algorithm to decide when to take predictions into account, and we derive the optimal value of the checkpoin...

متن کامل

Stability Assessment Metamorphic Approach (SAMA) for Effective Scheduling based on Fault Tolerance in Computational Grid

Grid Computing allows coordinated and controlled resource sharing and problem solving in multi-institutional, dynamic virtual organizations. Moreover, fault tolerance and task scheduling is an important issue for large scale computational grid because of its unreliable nature of grid resources. Commonly exploited techniques to realize fault tolerance is periodic Checkpointing that periodically ...

متن کامل

Defining the Checkpoint Interval for Uncoordinated Checkpointing Protocols

Parallel applications running on large computers suffer from the absence of a reliable environment. Fault tolerance proposals, in general, rely on rollback-recovery strategies supported by checkpoint and/or message logging. There are well-defined models that address the optimum checkpoint interval for coordinated checkpointing. Nevertheless, there is a lack of models concerning uncoordinated ch...

متن کامل

Energy-efficient Checkpointing in High-throughput Cycle-stealing Distributed Systems

Checkpointing is a fault-tolerance mechanism commonly used in High Throughput Computing (HTC) environments to allow the execution of long-running computational tasks on compute resources subject to hardware or software failures as well as interruptions from resource owners and more important tasks. Until recently many researchers have focused on the performance gains achieved through checkpoint...

متن کامل

On Energy-efficient Checkpointing in High-throughput Cycle-stealing Distributed Systems

Checkpointing is a fault-tolerance mechanism commonly used in High Throughput Computing (HTC) environments to allow the execution of long-running computational tasks on compute resources subject to hardware and software failures and interruptions from resource owners. With increasing scrutiny of the energy consumption of IT infrastructures, it is important to understand the impact of checkpoint...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1207.6936  شماره 

صفحات  -

تاریخ انتشار 2012